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Abstract. The paper deals with the problem of identifying the internal 
dependencies and similarities among a large number of random processes. Linear 
models are considered to describe the relations among the time series and the 
energy associated to the corresponding modeling error is the criterion adopted 
to quantify their similarities. Such an approach is interpreted in terms of graph 
theory suggesting a natural way to group processes together when one provides the 
best model to explain the other. Moreover, the clustering technique introduced in 
this paper will turn out to be the dynamical generalization of other multivariate 
procedures described in literature. 



1. Introduction 

Deriving information from data is a crucial problem in science and it has been widely 
investigated in literature. A large variety of contributions has been developed in many 
fields, such as engineering, physics, biology and economy, providing several methods 
and procedures which accomplish to different objectives [U E] |3l I5- In particular, in 
the study of complex systems, the comprehension of the internal connections, which 
define the hierarchical structure of the process, turns out to play a key role to fully 
understand its dynamics. This is especially true in presence of a multivariate data set, 
because this kind of samples is usually the result of a process intrinsically organized 
into interconnected subsystems [5]. Therefore, the recognition of the system structure 
is a critical step for the definition of a suitable model. In particular, a clusterization 
problem can be solved to divide the source data set into interconnected homogeneous 
groups describing different subsystems |0. This approach deals with the search of 
similarities and relations inside the original samples, trying to catch their internal 
connections and providing a schematic representation of hierarchies. Recently, new 
clustering techniques based on a correlation matrix have been proposed for the analysis 
of data sets made up by a large variety of time series 13 [8]. However, these procedures 
are able to detect only the "static" relations among the samples, since they capture 
the similarities just at the current time [9] HO] [TT] . 

In this paper, we propose a clustering technique based on a modeling approach. 
Indeed, since the original time series are dynamically interconnected, we intend to 
derive their hierarchy in terms of mathematical laws, which provide a structured 
description of the internal mechanics. To this aim, we settle the clustering problem 
into the framework of the system identification theory |121 IT3| . Hence, exploiting the 
modeling errors to quantify the similarities among the original signals, we realize a 
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clustering technique, defined as the solution of a minimization problem. Therefore, 
a modeling interpretation of the procedures based on the correlation matrix is first 
introduced. In particular, they turn out to be a non optimal choice with respect to 
the modeling error. Then, the approach is developed taking into account dynamic 
dependencies among the time series. To this regard, the identification step is realized 
introducing the hypothesis of linear dynamic connections, represented by Single Input- 
Single Output (SISO) local models. Moreover, since the clusters are internally 
organized by means of transfer functions, the final model can be interpreted as a 
dynamical network of interconnected systems and its structure as the related topology. 

Notation: 

E[-]: mean operator; 

Rxy{t) = E[X{t)Y{t + t)]: cross-covariance function of stationary processes; 

Rx{t) = Rxx{t): autocovariance; 

pxY = /^^^ : correlation index; 

Z{-): Zeta-transform of a signal; 

^xy{z) = Z{Rxy{t)): cross-power spectral density; 

^x{z) = ^xx{z): power spectral density; 

with abuse of notation, ^x{^) — *i'x(e*'^); 

[•] and [-J : ceiling and floor function respectively; 

(•)*: complex conjugate. 

2. A Modeling Perspective 

In 12 a procedure to obtain a hierarchical structure of a set of time series is proposed. 

realizations of N random processes Xi are considered. First, an estimation of the 
correlation index pij related to every couple {Xi, Xj) is computed, along with the 
associated distances (see j4]) 



Then, a graph is defined where every node represents a random process and the arc 
linking two nodes is weigthed according to Eventually, the Minimum Spanning 
Tree (MST) is extracted by the graph. This procedure has been successfully exploited 
to provide a quantitative and topological analysis of time series, expecially in the 
economic field (see |3], |8] and [11]). It is worth considering that such a technique can 
be interpreted in terms of a modeling procedure. Consider the problem of describing a 
process Xj by scaling another process Xi with a suitable real constant aji. Choosing 




(1) 




we find that 



E [{X, - a,,X,f] - E[X]] dl 



Hence, the distance ([T| can be interpreted as the root of the mean square error, 
properly normalized by the variance of Xj, when the simple gain ([2j is used. Such a 
normalization is necessary since we are interested into capturing similar trends between 
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the processes regardless of their amplitudes. However, we remark that the choice of 
(|2| can be considered arbitrary. Conversely, we would like to evaluate the closeness of 
two processes according to the information which can be inferred about one of them 
assuming to know the other |T3]. From this point of view, ([2} does not satisfy any 
optimality criterion. Indeed, considering two anticorrelated time series {pij = —1) it 
is possible to perfectly reconstruct one from the other. Thus the information in the 
two signals is the same, while their distance ([ij makes them the farest away. 
Let us define 



then, it is possible to adopt the least squares criterion in order to evaluate the "best" 
constant ajt. In this case, it is immediate to prove that the optimal choice is given by 



- o (4) 



and the relative quadratic error amounts to 



E[e%]^Rx,-^^ (5) 

(see [13]). In order to obtain an adimensional quantity, we can normalize ([5| with 
respect to the power of Xj and define the binary function 



d{X..X,) = ^ ^ = ^1-Pl,x, ■ (6) 

It is worth observing that ([6} is a distance exactly as 

Proposition 1. The function d{-, ■), as defined in is a metric. 

Proof. See the appendix. □ 
In |9], the MST is extracted from the graph, according to the weights ([iJ. This is 
equivalent to define a hierarchical structure of the time series relying on the adoption 
of linear gain models ^ between the processes and considering the relative modeling 
error as a distance function. 

Substituting ([T| with (|6}, we are applying the same topological strategy, but we are 
structuring the data according to the the best gain model in the sense of the least 
squares. 

Remark 2. From a system theory point of view, it can he said that both the approaches 
are "static". Indeed, the models do not have a state, thus they do not have any 
dynamics. They simply capture a direct relation between two process samples at the 
same time instant. However, the optimal approach we have followed can be extended 
to a more general case. 



3. Dynamic Modeling using Wiener Filters 

We consider a model to be "static" (or "memoryless") when, at every time instant t, 
its output is a function of its input at the very same time instant. Conversely, the 
output of a "dynamic" model also depends on the input values it receives at instants 
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different from t. In this general sense, we say that it has a "memory" (or, equivalently, 
a "state"). Constant gains as ([2]| or (|4j are Unear static models offering an extremely 
simple proportional relation between two processes. We propose a dynamic extension 
of the linear approach just described in the previous section based on the well-known 
Wiener Filter. 

Given two stochastic processes Xi, Xj and a time discrete transfer function Wji{z) 
(that is the zeta-transform of its impulse response), let us consider the quadratic cost 

E [{egf] (7) 

where 



eQ = Q{z){X,~W,,{z)X,) (8) 

being Q{z) an arbitrary stable and causally invertible time-discrete transfer function 
weighting the error 

ej,^Xj-Wj,{z)X,. (9) 

Then, the problem of evaluating the transfer function W{z) such that the quadratic 
cost ([t]) is minimized is well-known in scientific literature and its solution is referred 
to as the Wiener filter (see |13j. 

Proposition 3 (Wiener filter). The Wiener filter modeling Xj by Xi is the linear 
stable filter Wji minimizing the filtered quantity Its expression is given by 

«>M - ^ (10) 

and it does not depend upon Q{z). Moreover, the minimized cost is equal to 

mini? [(Q(z)e)2] = ^ £ |g(c.)p (<i>x,(c.) - \<i>x,xM\^'^x») doJ 

Proof. See, for example, [13] □ 
Observe that the stable implementation of the Wiener filter Wji{z) is non-causal, 
in general. That is, its output Wji{z)Xi depends on both past and future values of 
the input process Xi. The Wiener filter, in this formulation, is interesting from an 
information and modeling point of view, but, of course, we would rather need a causal 
filter, if we were to make predictions (aim which is beyond the scope of this paper). 
Since the weighting function Q{z) does not affect the Wiener filter, but only the energy 
of the filtered error, we choose Q{z) equal to Fj{z), the inverse of the spectral factor 
of (z), that is 

<i>xAz) = F^\z){Fr\z)r (11) 

with Fj{z) stable and causally invertible (see [15]). In such a case the minimum cost 
assumes the value 



mini.[4j^^£(l 



duj. (12) 
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This peculiar choice of Q{z) makes the cost depend expHcitly on the coherence function 
of the two processes 



1$ 



(13) 



which turns to be non negative and symmetric with respect to uj. It is also well-known 
that the cross-spectral density satisfies the Schwartz inequality. Hence, the coherence 
function is limited between and 1. The choice Q{z) — Fj{z) can be now understood 
as motivated by the necessity to achieve an adimensional cost function not depending 



on the power of the signals as in (12 1. 



The cost obtained by the minimization of the error ep- using the Wiener filter as 
before allows us to define the binary function 



d{Xi , Xj 



27T 



1/2 



(14) 



Proposition 4. The function d{-, ■), as defined in is a metric. 

Proof. See the appendix. □ 
The metric (14 1 can now be used to derive a MST and obtain a hierarchical 
structure of the processes X^. Such an approach generalizes the results in |9l to the 
linear dynamic case. We remark that the choice of a tree to describe the topology of 
the data is a very reasonable but arbitrary solution. In order to capture influences and 
similarities among the processes Xj, we intend to propose a more flexible modeling 
technique to extract topological information from the data. Every Xj can be described 
as the output of a linear SISO dynamical system, whose input is one of the other N—1 
processes. Thus, for every time series Xj it is natural to choose the model Wj^u) with 



input X„j(j-), such that it provides the best description according to ( 12 1, dropping the 
others. The application of this procedure results in a set N interconnected systems, 
each of them minimizing min^ i?[(Qjeji)^]. Since the choice of every model Wjra(j) 
does not affect the selection of the other ones, the overall cost function 



m(-) ^ 



(15) 



turns out to be minimized, as well. The following algorithm performs such a task. 



Clusterization Algorithm: 

1 . initialize the set A = 

2 . for every process Xj (j/ = 1, N) 
2a . for every i — I, N,i ^ j 

compute the distance dij = d{Xi,Xj); 
2b. define the set M{j) = {k\dkj = min^ dij} 
2c. choose, if possible, m(j) S M{j) such that {'m{j),j) ^ A 
2d. choose the model 

Xj = Wj,n(^j){z)X^(^j) + ejm(j) 

2e. add the couple {j,m{j)) to A. 



Figure 1. The figures illustrates all the possible connections between two nodes 
(dashed lines) in a nine-node network. The solid lines depict a forest as it were 
the result after the application of the algoritm A. 



Figure 2. The figure illustrates all the possible connections between two nodes 
(dashed lines) in a 10 nodes network. The solid lines depict a forest as it were the 
result after the application of the clusterization algorithm. 



The resulting network of processes has an appealing graphical interpretation. Indeed, 
its topological structure can be seen as a weigthed graph where every process Xj is a 
node, the arc linking Xi to Xj represents the Wiener Filter describing the "output" Xj 



in terms of the "input" Xi and the weights on the arcs are given by (14 1. Because of 



the simmetry property of (14 1, there is no actual need to consider an oriented graph. 
Hence, the presence of both the arcs (z, j) and (j, i) boils down into just a single link. 
Following this interpretation, the algorithm determinates a graph designed to keep, 
for every node, the incident arc with the least cost (see Figure [ij. 

Proposition 5. The graph resulting from the proposed algorithm has the following 
properties: 

• on every node there is at least an incident arc 

• if there is a cycle, then all the arcs of the cycle have the same weight 

• there are at least \N/2\ and at most N arcs. 

Proof. See the appendix. □ 
The presence of cycles in the resulting graph is a pathological situation as stressed 
in the following remark. 

Remark 6. A necessary condition of existence for a cycle is the presence of more 
than two nodes with common multiple minimum cost arcs. Therefore, a mild sufficient 
condition in order to avoid cycles in the graph is to assume that every node has a unique 
minimum cost arc. If the costs of the arcs are obtained by estimation from real data 
the probability to obtain a cycle is zero almost everywhere (see flW). Consequently, in 
such a case the expected topology of the graph is a forest (a graph with no cycles). 
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Figure 3. The figure illustrates the topology of the 10 nodes network analyzed in 
the numerical examples paragraph. Each node represent a process Xj, while the 
arcs describe the connections among them, according to the linear SISO model 
([Te}. For the data generation we have considered only transfer functions of at 
most the second order. The noises Nj have been assumed to provide half the 
power of the affected processes. The samples have been collected over 1000 time 
steps. 



Remark 7. If there are no cycles, the graph resulting from the algorithm is a subgraph 
of the MST. 

Remark 8. In general, nothing can be said about the connectivity. Therefore, the 
modeling procedure depicted by the algorithm provides a clusterization of the original 



processes Xi which, for every node, minimizes the cost ( 14 ) according to the criterion 
of linear dynamic dependency. It is possible to modify the procedure in order to 
suitably satisfy other constraints about the graph topology. For instance, if we deal 
with a connectivity condition the algorithm can be easily replaced by a MST search. 
Therefore, the approach followed in l9j to obtain topological information from the time 



series results in a constrained optimization of {15). 

Remark 9. The modelization we have derived makes use of non causal Wiener filters, 
thus it can be useful to detect linear dependencies of any sort between the processes 

Unfortunately, the adoption of non causal filters can not be employed to make 
predictions. 



4. Numerical Example 

It is intended to show, by means of numerical examples, the main advantages of 
the technique described in the previous sections. In particular, we want to evaluate 
the performance of our procedure when identifying an unknown topology. First, 
we realized several simulations of 10 randomly generated processes Xj, designed 
as follows. They have been hierarchiacally structured in a tree topology, where 
the interconnections were linear, randomly generated, at most second order transfer 
functions Wji with external noise Nj. 

= Wj,{z)X, + N, (16) 

Since all simulations present strong analogies, we are showing just one of them, 
whose topology is depicted in Figure |3] Note that the simulated network involves 
linear dependencies only, so it satisfies the theoretical conditions of the approach 
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Figure 4. The MST obtained using tiie correlation-based distance of Tabled . 
The actual topology has not been correctly identified, though some analogies with 
the right structure can be observed. The procedure described in reveals strong 
limitation in capture the nature of the network even when the actual topology is 
exactly a tree. 



based on Wiener filters introduced in the previous sections. On every node Xj 
(but the root) the deterministic component Wji{z)Xi and the stochastic disturbance 
are equal in power. A simulation horizon of 1000 steps has been taken into 
account where the noise components have been generated by pseudo random number 
algorithms. The incorrelation hypothesis among the disturbances has been numerically 
checked providing a marginally satisfactory result. Applying the correlation technique 
described in |3], we found the distance matrix reported in Table [l] and the 
corresponding MST depicted in Figure [7| We note that the topology is not correctly 
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0.9946 
1.3763 
1.0624 
1.1027 
1.2393 
1.2719 
1.3747 
0.7306 
1.4589 



0.9946 
3.3e-8 
1.1130 
1.0674 
0.7723 
1.0082 
1.2004 
1.1269 
1.1132 
1.4575 



1.3763 
1.1130 


1.1487 
1.2217 
1.2877 
1.1645 
0.9965 
1.3507 
1.4124 



1.0624 
1.0674 
1.1487 
4.2e-8 
1.1727 
1.1805 
0.9296 
1.1455 
1.1491 
1.3433 



1.1027 
0.7723 
1.2217 
1.1727 
3.9e-8 
1.1491 
1.2418 
1.2353 
1.1898 
1.4587 



1.2393 
1.0082 
1.2877 
1.1805 
1.1491 
4.9e-8 
1.2123 
1.2984 
1.2858 
1.3227 



1.2719 
1.2004 
1.1645 
0.9296 
1.2418 
1.2123 


1.1815 
1.3003 
1.3334 



1.3747 
1.1269 
0.9965 
1.1455 
1.2353 
1.2984 
1.1815 


1.3542 
1.4389 



0.7306 
1.1132 
1.3507 
1.1491 
1.1898 
1.2858 
1.3003 
1.3542 
7.3e-8 
1.4450 



1.4589 
1.4575 
1.4124 
1.3433 
1.4587 
1.3227 
1.3334 
1.4389 
1.4450 




Table 1. Correlation-based distance matrix. 



identified by such a procedure, even though similarities can be identified. On the 
other hand, the application of the clusterization algorithm introduced by us provides 
the distances of Table [2] with the graph of Figure [s] We stress that the topology 
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Table 2. Coherence-based distance matrix. 



is perfectly reconstructed by the procedure if the connectivity constraint is imposed. 
Further, we repeated the same procedure with a larger number of processes {N = 50). 
Again, results showed many similarities, so we are presenting just one case with the 
topology depicted in Figure |6] Analogously, the correlation-based approach provides 



9 




Figure 5. The figure illustrates the MST obtained by using the coherence-based 
distance (solid+dashed lines). Notably, it is the same of the actual topology. The 
application of the proposed clustering algorithm provides a forest (solid lines): 
each cluster is virtually connected to the others by the arcs of the MST, which 
have not been chosen by the algorithm (dashed lines). The use of different colors 
(online) highlights the modular structure resulting from the clusterization. The 
presence of a very high noise-to-signal ratio prevents the algorithm to correctly 
reconstruct the actual topology, if no connectivity constraint is given. 



Figure 6. The 50 nodes network of the numerical examples paragraph. The 
figure provides the actual topology. The example has been designed according to 
the same assumptions of the network of Figure |6] 
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Figure 7. The figure illustrates the MST obtaining by means of the correlation- 
based distance. Though the original topology is a tree, a quite significative 
amount of connections have not been correctly reconstructed. A limited number 
of similarities with the actual network can be observed. 

the MST of Figure [7] while the coherence-based algorithm identifies the graph of 
Figure |8] It is worth noting that our technique detects links actually present in 
the topology. However the presence of a low signal to noise ration prevents the 
complete reconstruction of the original tree topology in the absence of any connectivity 
constraints. These simple examples highlight a better capability of our technique into 
capturing relationships and dependencies among time series. In particular, remarkable 
improvements should be expected in presence of strong dynamical interconnections 
and significative delays in the actual network. 

5. Conclusions 

In this paper we have introduced a novel approach to the clusterization problem. 
In particular, the similarities among the time series of a multivariate data set have 
been analyzed from the modeling point of view and their interconnections have been 
interpreted as functional dependencies. Hence, linear SISO transfer functions have 
been proposed to describe the relations among the processes and the associated 
modeling errors have been exploited to quantify their similarities. In turn, such a 
distance introduces a natural way of grouping the time series, since it is very reasonable 
to place two processes in the same subset, when one provides the best model to explain 
the other. Notably, the proposed distance can be directly computed exploiting the 
coherence function, requiring no identification step. Further, our novel approach 
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Figure 8. The figure show the MST (solid+dashed lines) obtained by applying 
the coherence-based distance ( [16) to the processes produced by the network 
depicted in Figure |6] Notably, the original structure has been correctly 
reconstructed. The forest resulting from the application of the clusterization 
algorithm is also reported (solid lines). The clusters (colors online) result 
connected by the remaining arc of the MST (dashed lines). The algorithm does 
not manage to reconstruct exactly one cluster, because of the high noise-to-signal 
ratio. 

has been compared to the clustering technique proposed in [4] and formulated as 
an extension of the multivariate analysis of 0. In particular, our coherence-based 
distance turns out to be the dynamical generalization of the correlation-based metric 
in |9]. Therefore, it provides an improved capability in capturing the internal topology 
among the processes, especially when their functional dependencies turns out to be 
dynamical laws. Some numerical examples have finally been presented to illustrate 
the expected improvements, due to our distance, and to provide a validation for our 
clustering algorithm. 

6. Appendix 

Proof of Proposition^. Note that we consider two processes be equivalent also when 
they are anticorrelated since they are identical from an information point of view. 
Thus, the only non trivial property to show is the triangle inequality. Consider the 
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following relations involving the optimal gains a^i, 6(32, oli\ 

— OLzi^i + 631 

— OLZ2^2 + 632 
X2 = OL2\Xi + 621 • 

Since ol-h is the best constant model, we have that it must perform better than any 
other constant model (in particular (3!32Q;2i) 

Rx, - < E[{e,2 + «32e2i)'] < (v^iSi2] + l«32|/B[^ 

Normalize with respect to and consider the square root 



1 



1 - P^,X3 ^ y ( V ^[-^-32] + i«32i v^i^^^i] ) < 



2 



_ ^[^32],, I E[elr] 

- + IPXsXaU/^ • 

Since IPX2X3I < 1, we have the assertion. □ 
Proof of Proposition |7] The only non trivial property to prove is the triangle 
inequality. Let Wji{z) be the Wiener filter between Xi,Xj computed according to 
(10 1 and eji the relative error. The following relations hold: 

^3 = 1^31(^)^1+631 

-'^3 = T^32(z)-'^2 + 632 
X2 = ^^21(2)^1 +621. 

Since W3i(2;) is the Wiener filter between the two processes Xi and X3, it performs 
better at any frequency than any other linear filter, such as W32{z)W2i{z) . So we have 



<^eM < $e3.(^) + \W32{Lu)\^<i>eM + 

+ ^e3.e.i(^)W^3*2('^) + W32{uj)^ e,,e,A^) < 

< iV^cM + \W32iL0)W'i>eMf ^LoeR. 

For the sake of simplicity we neglect to explicitly write the argument uj in the following 
passages. Normalizing with respect to ^Xs, we find 

|j<^(y^+|1^32|y^f 

and considering the 2-norm properties 



— TT 
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where we have substituted the expression of W32- Finally, considering that 

< ' -^^"l < 1, 

we find 

d{Xi,X3) <d{Xi,X2) + diX2,X3). 

□ 

Proof of Proposition^The proof of the first property is straightforward because 
for every node the algorithm considers an incident arc. Let us suppose there is a cycle 
and be k the number of nodes ni, ...,11^ and arcs ai, of such a cycle. Every arc 
ai, ...,afc has been chosen at the step 2e when the algorithm was taking into account 
one of the nodes ni, Conversely, every node ni,...,nk is also responsible for one 

of the arcs ai, Ofc. Indeed, if a node rii causes the selection of an arc d ^ {ai, ak}, 
then we are left with the k arcs which cannot all be chosen by fc — 1 nodes. 
Let us consider the node ni. Without loss of generality assume that it is responsible 
for the selection of the arc oi with weight di and linking it to the node n2- According 
to the previous results, 77.2 can not be responsible for the choice of oi. Let 02 be 
the arc selected because of n2 with weight d2 and connecting it to n^. Observe that 
necessarily d2 < di. We may repeat this process till the node Uk-i- Hence, we obtain 
that every node is connected to n^+i by the arc whose cost is di < for 
i — 2,...,k — 1. Finally consider n^. It must be responsible for which has to 
connect it to ni with cost dk < dk^i- Since dk is incident to ni it holds that di < dk 
Therefore di < dk < dk-i--- < ^2 < f^i and we have the assertion of the second 
property. 

About the third property, the upper bound N follows from the consideration that 
every node causes the choice of at most a new arc. In step 2c of the algorithm, it may 
happen at most [-/V/2J times that we are forces to pick up an arc which is already in 
A. So we have at least N - \N/2] = [N/2\ arcs □ 
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